Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: aronsson Date: Mar 9, 2012 12:59am
Forum: texts Subject: OCR of Russian with pre-1917 alphabet (soft-dotted i, yat)

I uploaded some Russian 19th century books as PDF with scanned images, set Language: Russian, and got fine OCR results. However, the output only contains letters of the modern-day (post-1917) Russian alphabet, and random garbage for the old (pre-1917) letters "soft-dotted i" (І і) and "yat" (Ѣ ѣ). It would be most helpful if these could be added to the recognition set.

Example, based on scans from Runivers.ru:
http://www.archive.org/details/geo_stat_rus_imp_1

Example, based on scans from Google Books:
http://www.archive.org/details/geo_stat_rus_imp_3

About the Russian alphabet:
http://en.wikipedia.org/wiki/Russian_alphabet

The OCR software ABBYY Finereader Professional has support for "Russian (old spelling)", so the same should be true for the Finereader (Engine?) version that IA uses. But when I specified this as a language for this book,
http://www.archive.org/details/Annensky_Ixion_1902
the process log said "Setting OCR language to English". The result, of course, was garbage.
I then tried to set language to Ukrainian. This caused the soft-dotted i to be correctly interpreted, but yat (another pre-1917 Cyrillic letter) was still missing.

This post was modified by aronsson on 2012-03-09 07:39:22

This post was modified by aronsson on 2012-03-09 08:59:11

Reply to this post
Reply [edit]

Poster: aronsson Date: Mar 21, 2012 3:49am
Forum: texts Subject: Re: OCR of Russian with pre-1917 alphabet (soft-dotted i, yat)

Apparently, the OCR has now been adjusted to Russian,RussianOldSpelling, and it works well, thanks! I submitted
http://www.archive.org/details/Annensky_Ixion_1902
for new OCR, and the new result is better. You should consider to rerun OCR on other Russian texts.